17 Combining Datasets

17.1 Introduction: Why Combine Datasets?

In real-world data analysis, information is rarely contained in a single dataset. You’ll often need to combine multiple datasets from different sources to get a complete picture. This process is fundamental to data analysis and goes by several names:

Merging - joining datasets based on common keys
Joining - combining datasets using indices
Concatenating - stacking datasets vertically or horizontally

17.1.1 Common Scenarios Requiring Data Combination

1. Distributed Information

Customer demographics in one table, purchase history in another
Product details separate from sales transactions
Survey responses split across multiple files

2. Data from Multiple Sources

Internal company data + external market data
Multiple databases (HR, Sales, Inventory)
Different time periods stored in separate files

3. Analytical Requirements

Enriching existing data with additional attributes
Comparing data across different groups or time periods
Building comprehensive datasets for machine learning

17.1.2 The Three Main Combination Methods

Pandas provides three powerful functions for combining datasets, each suited to different scenarios:

Method	Function	Use Case	Key Parameter
Merge	`merge()`	Combine datasets with common columns (like SQL JOIN)	`on=` (column name)
Join	`join()`	Combine datasets using their indices	`lsuffix=`, `rsuffix=`
Concatenate	`concat()`	Stack datasets vertically or horizontally	`axis=` (0 or 1)

In this chapter, we’ll explore each method with practical examples, understand when to use which approach, and learn how to handle common challenges like missing values and duplicate columns.

17.1.3 Setup: Loading Required Libraries

Let’s start by importing the necessary libraries and loading our example datasets.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

17.2 Method 1: Merging with `merge()`

The merge() function is pandas’ most versatile tool for combining datasets. It works similarly to SQL JOIN operations, combining DataFrames based on one or more common columns (called keys).

17.2.1 Basic Syntax

pd.merge(left_df, right_df, on='key_column', how='inner')

Or using the DataFrame method:

left_df.merge(right_df, on='key_column', how='inner')

Let’s see this in action!

17.2.2 Example Datasets: Lord of the Rings Fellowship

Throughout this chapter, we’ll use multiple datasets that contain complementary information, we’ll combine using different methods.

Let’s first load the primary dataset containing basic fellowship member information:

# Load the primary dataset
df_existing = pd.read_csv('./datasets/LOTR.csv')

print(f"Dataset shape: {df_existing.shape}")
print(f"Columns: {list(df_existing.columns)}\n")

df_existing.head()

Dataset shape: (4, 3)
Columns: ['FellowshipID', 'FirstName', 'Skills']

	FellowshipID	FirstName	Skills
0	1001	Frodo	Hiding
1	1002	Samwise	Gardening
2	1003	Gandalf	Spells
3	1004	Pippin	Fireworks

Now let’s load the external dataset containing additional information about the fellowship members:

# Load the external dataset
df_external = pd.read_csv("./datasets/LOTR 2.csv")

print(f"Dataset shape: {df_external.shape}")
print(f"Columns: {list(df_external.columns)}\n")

df_external.head()

Dataset shape: (5, 3)
Columns: ['FellowshipID', 'FirstName', 'Age']

	FellowshipID	FirstName	Age
0	1001	Frodo	50
1	1002	Samwise	39
2	1006	Legolas	2931
3	1007	Elrond	6520
4	1008	Barromir	51

Observation: Both datasets contain information about fellowship members, but they may have:

Common columns that can serve as keys for merging (like FellowshipID or Name)
Different sets of attributes that complement each other
Potentially different numbers of rows (not all members may appear in both datasets)

Our goal is to combine these datasets to create a comprehensive view of the fellowship.

17.2.3 Example 1: Basic Merge with Automatic Key Detection

When both DataFrames have columns with the same name, merge() automatically uses them as keys:

# Merge using automatic key detection
df_merged = pd.merge(df_existing, df_external)

print(f"Original datasets: {df_existing.shape} + {df_external.shape}")
print(f"Merged dataset: {df_merged.shape}")
print(f"\nMerged columns: {list(df_merged.columns)}\n")

df_merged.head()

Original datasets: (4, 3) + (5, 3)
Merged dataset: (2, 4)

Merged columns: ['FellowshipID', 'FirstName', 'Skills', 'Age']

	FellowshipID	FirstName	Skills	Age
0	1001	Frodo	Hiding	50
1	1002	Samwise	Gardening	39

Alternative Syntax: You can also use the DataFrame method, which is more readable in method chains:

# Using DataFrame.merge() method (equivalent to above)
df_merged_method = df_existing.merge(df_external)

print(f"Results are identical: {df_merged.equals(df_merged_method)}")
print(f"Merged dataset shape: {df_merged_method.shape}\n")

df_merged_method.head()

Results are identical: True
Merged dataset shape: (2, 4)

	FellowshipID	FirstName	Skills	Age
0	1001	Frodo	Hiding	50
1	1002	Samwise	Gardening	39

17.2.4 Example 2: Merging with Different Column Names

What if the key columns have different names in each DataFrame? Use the left_on and right_on parameters to specify which columns to match.

# Example: Suppose the external dataset uses 'ID' instead of 'FellowshipID'
# Let's create a version with a different column name to demonstrate
df_external_renamed = df_external.rename(columns={'FellowshipID': 'ID'})

print("External dataset columns after renaming:")
print(list(df_external_renamed.columns))
print()

# Now merge using left_on and right_on
df_merged_diff = pd.merge(
    df_existing, 
    df_external_renamed,
    left_on='FellowshipID',   # Column name in left DataFrame
    right_on='ID'              # Column name in right DataFrame
)

print(f"Merged dataset shape: {df_merged_diff.shape}")
print(f"Merged columns: {list(df_merged_diff.columns)}\n")

df_merged_diff.head()

Important observations:

Notice that both key columns (FellowshipID and ID) appear in the result
They contain identical values (the matched keys)
You can drop one of them after merging: df_merged_diff.drop('ID', axis=1)

Common use case: Merging data from different sources where column naming conventions differ (e.g., customer_id vs cust_id, date vs timestamp).

17.2.5 Example 3: Merging on Multiple Columns

Sometimes you need to match rows based on multiple columns together (composite keys). This is common when a single column isn’t unique.

# Example: Merge on both Name AND Race to ensure exact matches
df_multi_key = pd.merge(
    df_existing, 
    df_external,
    on=['Name', 'Race']  # Both columns must match
)

print(f"Merged on multiple keys: {df_multi_key.shape}")
print("\nMerging on multiple columns ensures rows match on ALL specified keys\n")

df_multi_key.head()

Real-world examples of composite keys:

Sales transactions: Match on both Store_ID AND Date (same store can have sales on different dates)
Student grades: Match on Student_ID AND Course_ID (same student takes multiple courses)
Time series data: Match on Location AND Timestamp (multiple locations measured over time)

Syntax variations:

# Using on= (when column names are the same)
pd.merge(df1, df2, on=['col1', 'col2'])

# Using left_on= and right_on= (when column names differ)
pd.merge(df1, df2, left_on=['id', 'date'], right_on=['ID', 'timestamp'])

17.2.6 Understanding Join Types: The `how` Parameter

By default, merge() performs an inner join, which keeps only rows where the key exists in both DataFrames. However, you can control this behavior using the how parameter.

The Four Join Types:

Join Type	`how=`	Keeps Rows From	Use Case
Inner	`'inner'`	Both (intersection)	Only matching records
Left	`'left'`	All from left + matches from right	Keep all primary data
Right	`'right'`	All from right + matches from left	Keep all secondary data
Outer	`'outer'`	Both (union)	Keep everything

Join Types Visualization

Let’s explore each type with examples.

17.2.6.1 Inner Join (Default): Only Matching Rows

An inner join returns only the rows where the key exists in both DataFrames. This is the default behavior.

# Inner join - explicitly specified (same as default)
df_inner = pd.merge(df_existing, df_external, how='inner')

print(f"Inner join result: {df_inner.shape}")
print("Only rows with matching keys in BOTH datasets are kept\n")

df_inner.head()

Inner join result: (2, 4)
Only rows with matching keys in BOTH datasets are kept

	FellowshipID	FirstName	Skills	Age
0	1001	Frodo	Hiding	50
1	1002	Samwise	Gardening	39

When to use inner join:

Use inner join when you need complete information from both datasets for your analysis.

Real-world examples:

Analyzing vaccination impact on COVID infection rates
- Dataset 1: COVID infection rates by country
- Dataset 2: Vaccination counts by country
- Why inner join? You can’t analyze the relationship unless you have BOTH metrics for a country. Countries missing either value can’t contribute to the analysis.
Customer purchase analysis
- Dataset 1: Customer demographics
- Dataset 2: Purchase transactions
- Why inner join? To analyze how demographics relate to purchases, you need both demographic info AND purchase history.
A/B testing results
- Dataset 1: User group assignments (control/treatment)
- Dataset 2: Conversion outcomes
- Why inner join? Only users with both group assignment AND outcome data can be included in the test analysis.

17.2.6.2 Left Join: Keep All Left DataFrame Rows

A left join returns all rows from the left DataFrame and matching rows from the right DataFrame. Non-matching rows from the right will have NaN values.

# Left join - keep all rows from left (existing) dataset
df_left = pd.merge(df_existing, df_external, how='left')

print(f"Left DataFrame shape: {df_existing.shape}")
print(f"Right DataFrame shape: {df_external.shape}")
print(f"Left join result: {df_left.shape}")
print("\nAll rows from LEFT dataset are kept, even if no match in RIGHT\n")

# Show rows with missing values from the right dataset
missing_count = df_left.isnull().any(axis=1).sum()
print(f"Rows with NaN values (no match in right dataset): {missing_count}\n")

df_left.head(10)

Left DataFrame shape: (4, 3)
Right DataFrame shape: (5, 3)
Left join result: (4, 4)

All rows from LEFT dataset are kept, even if no match in RIGHT

Rows with NaN values (no match in right dataset): 2

	FellowshipID	FirstName	Skills	Age
0	1001	Frodo	Hiding	50.0
1	1002	Samwise	Gardening	39.0
2	1003	Gandalf	Spells	NaN
3	1004	Pippin	Fireworks	NaN

When to use left join:

Use left join when the left dataset contains your primary data that must be retained, while the right dataset provides supplementary information that may be missing for some records.

Real-world examples:

COVID infection rates + government effectiveness scores
- Left: COVID infection rates (critical, hard to impute)
- Right: Government effectiveness scores (can be estimated from GDP, crime rate, etc.)
- Why left join? Keep all countries with infection rate data. Missing government scores can be imputed later using correlated variables.
Customer demographics + credit card spending
- Left: Customer demographics (all customers)
- Right: Credit card transactions (only customers who made purchases)
- Why left join? Keep all customers, even those with no purchases. Missing spend can be filled with 0 (indicating no transactions).
Employee records + performance reviews
- Left: All employee records
- Right: Recent performance reviews (not all employees reviewed yet)
- Why left join? Maintain complete employee roster. Missing reviews can be handled separately (e.g., Pending status).

Key principle: Use left join when you can’t afford to lose the left dataset’s records, and missing right-side values can be reasonably handled (imputed, filled with defaults, or marked as missing).

17.2.6.3 Right Join: Keep All Right DataFrame Rows

A right join is the mirror image of a left join—it returns all rows from the right DataFrame and matching rows from the left DataFrame.

# Right join - keep all rows from right (external) dataset  
df_right = pd.merge(df_existing, df_external, how='right')

print(f"Left DataFrame shape: {df_existing.shape}")
print(f"Right DataFrame shape: {df_external.shape}")
print(f"Right join result: {df_right.shape}")
print("\nAll rows from RIGHT dataset are kept, even if no match in LEFT\n")

df_right.head()

Left DataFrame shape: (4, 3)
Right DataFrame shape: (5, 3)
Right join result: (5, 4)

All rows from RIGHT dataset are kept, even if no match in LEFT

	FellowshipID	FirstName	Skills	Age
0	1001	Frodo	Hiding	50
1	1002	Samwise	Gardening	39
2	1006	Legolas	NaN	2931
3	1007	Elrond	NaN	6520
4	1008	Barromir	NaN	51

When to use right join:

Right join serves the same purpose as left join, just with the DataFrames swapped. In practice:

Most analysts prefer left join because it’s more intuitive (reads left-to-right)
Right join is rarely used - you can always swap the DataFrames and use left join instead

Example equivalence:

# These produce the same result:
df_right = pd.merge(df_A, df_B, how='right')  # Right join
df_left = pd.merge(df_B, df_A, how='left')    # Equivalent left join

When you might see right join:

Legacy code or specific coding style preferences
SQL background where RIGHT JOIN might be more familiar
Method chaining where the order is determined by workflow

Best practice: Stick with left join and arrange your DataFrames accordingly—it makes code more readable and maintainable.

17.2.6.4 Outer Join: Keep All Rows from Both DataFrames

An outer join (also called a full outer join) returns all rows from both DataFrames, with NaN values where matches don’t exist. This is the most inclusive join type—no data is lost.

# Outer join - keep all rows from BOTH datasets
df_outer = pd.merge(df_existing, df_external, how='outer')

print(f"Left DataFrame shape: {df_existing.shape}")
print(f"Right DataFrame shape: {df_external.shape}")
print(f"Outer join result: {df_outer.shape}")
print("\nAll rows from BOTH datasets are kept, with NaN for non-matches\n")

# Check for missing values
print("Missing values per column:")
print(df_outer.isnull().sum())
print()

df_outer

Left DataFrame shape: (4, 3)
Right DataFrame shape: (5, 3)
Outer join result: (7, 4)

All rows from BOTH datasets are kept, with NaN for non-matches

Missing values per column:
FellowshipID    0
FirstName       0
Skills          3
Age             2
dtype: int64

	FellowshipID	FirstName	Skills	Age
0	1001	Frodo	Hiding	50.0
1	1002	Samwise	Gardening	39.0
2	1003	Gandalf	Spells	NaN
3	1004	Pippin	Fireworks	NaN
4	1006	Legolas	NaN	2931.0
5	1007	Elrond	NaN	6520.0
6	1008	Barromir	NaN	51.0

When to use outer join:

Use outer join when you cannot afford to lose any data from either dataset, even if it means having incomplete records.

Real-world examples:

Course surveys from multiple time points
- Dataset 1: Mid-semester survey responses
- Dataset 2: End-of-semester survey responses
- Why outer join? Students who responded to either survey provide valuable feedback. Keep all responses, even from students who only completed one survey. Analyze response patterns and sentiment across all participants.
Multi-source customer data integration
- Dataset 1: Online purchase records
- Dataset 2: In-store purchase records
- Why outer join? Customers may shop through one channel only. Keep all customers to understand total customer base and channel preferences.
Scientific study with partial data
- Dataset 1: Lab test results
- Dataset 2: Clinical observations
- Why outer join? Some subjects may have only lab results or only clinical observations due to scheduling, dropouts, or data collection issues. Keep all subjects to maximize sample size and identify patterns in data availability.
Data quality auditing
- Dataset 1: Expected records (should exist)
- Dataset 2: Actual records (what we have)
- Why outer join? Identify both missing expected records (in left but not right) and unexpected extra records (in right but not left) for data quality assessment.

Key principle: Use outer join when completeness trumps having matched pairs, and you’ll handle missing values appropriately in subsequent analysis.

17.3 Method 2: Joining with `join()`

While merge() combines DataFrames based on column values, the join() method combines them based on their indices. This is particularly useful when your DataFrames are already indexed by a meaningful key (like customer ID, timestamp, etc.).

17.3.1 Key Difference: `merge()` vs `join()`

Aspect	`merge()`	`join()`
Joins on	Column values	Index values
Default join	Inner	Left
Syntax	More explicit	More concise
Use when	Keys are in columns	Keys are in index

17.3.2 Important: Handling Overlapping Columns

Unlike merge(), join() requires you to specify suffixes for overlapping column names (it has no default). This is done with the lsuffix and rsuffix parameters.

Let’s see what happens without suffixes:

The code above will raise an error because both DataFrames have overlapping column names, and join() doesn’t know how to handle them without suffixes.

Let’s fix this by adding suffixes:

# Join with suffixes to handle overlapping column names
df_joined = df_existing.join(
    df_external, 
    lsuffix='_existing',  # Suffix for left DataFrame columns
    rsuffix='_external'    # Suffix for right DataFrame columns
)

print(f"Joined dataset shape: {df_joined.shape}")
print(f"\nColumns with suffixes:")
print([col for col in df_joined.columns if '_existing' in col or '_external' in col])
print()

df_joined.head()

Joined dataset shape: (4, 6)

Columns with suffixes:
['FellowshipID_existing', 'FirstName_existing', 'FellowshipID_external', 'FirstName_external']

	FellowshipID_existing	FirstName_existing	Skills	FellowshipID_external	FirstName_external	Age
0	1001	Frodo	Hiding	1001	Frodo	50
1	1002	Samwise	Gardening	1002	Samwise	39
2	1003	Gandalf	Spells	1006	Legolas	2931
3	1004	Pippin	Fireworks	1007	Elrond	6520

Key Observations:

Default behavior: join() performs a left join by default (keeps all rows from the left DataFrame)
Index-based: Rows are matched using the DataFrame indices (0, 1, 2, …)
Suffixes required: Overlapping column names get the specified suffixes

This works, but notice we’re joining on the default integer index (0, 1, 2, …), which may not be meaningful. Let’s make this more useful by setting a proper index.

17.3.3 Setting a Meaningful Index for Joining

For join() to be truly useful, we should set a meaningful column as the index (like FellowshipID). This way, rows are matched based on actual business logic, not arbitrary row numbers.

# Set 'FellowshipID' as the index for both DataFrames
df_existing_indexed = df_existing.set_index('FellowshipID')
df_external_indexed = df_external.set_index('FellowshipID')

print("Left DataFrame (indexed by Fellowship ID):")
print(df_existing_indexed.head())
print("\nRight DataFrame (indexed by FellowshipID):")
print(df_external_indexed.head())

Left DataFrame (indexed by Fellowship ID):
             FirstName     Skills
FellowshipID                     
1001             Frodo     Hiding
1002           Samwise  Gardening
1003           Gandalf     Spells
1004            Pippin  Fireworks

Right DataFrame (indexed by FellowshipID):
             FirstName   Age
FellowshipID                
1001             Frodo    50
1002           Samwise    39
1006           Legolas  2931
1007            Elrond  6520
1008          Barromir    51

# Now join using the FellowshipID index
df_joined_indexed = df_existing_indexed.join(
    df_external_indexed, 
    lsuffix='_existing',
    rsuffix='_external'
)

print(f"Joined dataset shape: {df_joined_indexed.shape}")
print(f"\nIndex name: {df_joined_indexed.index.name}")
print("\nNow rows are matched by FellowshipID, not arbitrary row numbers!\n")

df_joined_indexed

Joined dataset shape: (4, 4)

Index name: FellowshipID

Now rows are matched by FellowshipID, not arbitrary row numbers!

	FirstName_existing	Skills	FirstName_external	Age
FellowshipID
1001	Frodo	Hiding	Frodo	50.0
1002	Samwise	Gardening	Samwise	39.0
1003	Gandalf	Spells	NaN	NaN
1004	Pippin	Fireworks	NaN	NaN

Perfect! Now the join is meaningful:

Rows are matched by FellowshipID (a business key)
All rows from the left DataFrame are kept (default left join)
NaN values appear where FellowshipID doesn’t exist in the right DataFrame
The index preserves the key for easy lookup

17.3.4 Changing Join Type in `join()`

Just like merge(), you can specify the join type using the how parameter:

# Examples of different join types with join()
print("Inner join:")
print(df_existing_indexed.join(df_external_indexed, how='inner', lsuffix='_L', rsuffix='_R').shape)

print("\nOuter join:")
print(df_existing_indexed.join(df_external_indexed, how='outer', lsuffix='_L', rsuffix='_R').shape)

print("\nRight join:")
print(df_existing_indexed.join(df_external_indexed, how='right', lsuffix='_L', rsuffix='_R').shape)

Inner join:
(2, 4)

Outer join:
(7, 4)

Right join:
(5, 4)

17.4 Method 3: Concatenating with `concat()`

While merge() and join() combine DataFrames by matching keys or indices, concat() simply stacks DataFrames together either vertically (one on top of another) or horizontally (side by side). Think of it as physically gluing DataFrames together.

17.4.1 The Two Directions: `axis` Parameter

`axis=`	Direction	Effect	Use Case
0 (default)	Vertical	Rows on top of rows	Combining data from multiple time periods
1	Horizontal	Columns side by side	Adding new features/attributes

17.4.2 Concatenating Along Rows (Vertical Stacking)

The default behavior (axis=0) stacks DataFrames vertically—adding more rows.

17.4.2.1 Simple Example with Fellowship Data

Let’s start with a simple example using our LOTR datasets to understand the basics:

# Split our LOTR data into two groups to demonstrate concatenation
lotr_df1 = df_existing.head(5)  # First 5 fellowship members
lotr_df2 = df_existing.tail(4)  # Last 4 fellowship members

print("First group:")
print(lotr_df1)
print("\nSecond group:")
print(lotr_df2)

# Concatenate vertically (default: axis=0) - stack rows
lotr_combined = pd.concat([lotr_df1, lotr_df2])

print(f"First group: {lotr_df1.shape}")
print(f"Second group: {lotr_df2.shape}")
print(f"Combined: {lotr_combined.shape}")
print("\nRows are simply stacked on top of each other:\n")

lotr_combined

Key points about vertical concatenation:

Rows are simply stacked (group 1 rows, then group 2 rows)
Index values may be duplicated (notice indices 0-4 appear, then 5-8)
All columns from both DataFrames are kept
No key matching occurs—just physical stacking

Use ignore_index=True to create a new sequential index:

# Reset index to create sequential numbering
lotr_combined_reset = pd.concat([lotr_df1, lotr_df2], ignore_index=True)

print("With ignore_index=True, we get a clean sequential index:\n")
lotr_combined_reset

17.4.2.2 Real-World Example: Combining Continental Data

Now let’s see a practical real-world use case: combining GDP and life expectancy data that’s been split across multiple files by continent. This is a common scenario when data is stored in separate files for organizational purposes.

data_asia = pd.read_csv('./Datasets/gdp_lifeExpec_Asia.csv')
data_europe = pd.read_csv('./Datasets/gdp_lifeExpec_Europe.csv')
data_africa = pd.read_csv('./Datasets/gdp_lifeExpec_Africa.csv')
data_oceania = pd.read_csv('./Datasets/gdp_lifeExpec_Oceania.csv')
data_americas = pd.read_csv('./Datasets/gdp_lifeExpec_Americas.csv')

#Appending all the data files, i.e., stacking them on top of each other
data_all_continents = pd.concat([data_asia,data_europe,data_africa,data_oceania,data_americas],keys = ['Asia','Europe','Africa','Oceania','Americas'])
data_all_continents

		country	year	lifeExp	pop	gdpPercap
Asia	0	Afghanistan	1952	28.801	8425333	779.445314
	1	Afghanistan	1957	30.332	9240934	820.853030
	2	Afghanistan	1962	31.997	10267083	853.100710
	3	Afghanistan	1967	34.020	11537966	836.197138
	4	Afghanistan	1972	36.088	13079460	739.981106
...	...	...	...	...	...	...
Americas	295	Venezuela	1987	70.190	17910182	9883.584648
	296	Venezuela	1992	71.150	20265563	10733.926310
	297	Venezuela	1997	72.146	22374398	10165.495180
	298	Venezuela	2002	72.766	24287670	8605.047831
	299	Venezuela	2007	73.747	26084662	11415.805690

1704 rows × 5 columns

Let’s have the continent as a column as we need to use that in the visualization.

data_all_continents.reset_index(inplace = True)
data_all_continents.head()

	level_0	level_1	country	year	lifeExp	pop	gdpPercap
0	Asia	0	Afghanistan	1952	28.801	8425333	779.445314
1	Asia	1	Afghanistan	1957	30.332	9240934	820.853030
2	Asia	2	Afghanistan	1962	31.997	10267083	853.100710
3	Asia	3	Afghanistan	1967	34.020	11537966	836.197138
4	Asia	4	Afghanistan	1972	36.088	13079460	739.981106

data_all_continents.drop(columns = 'level_1',inplace = True)
data_all_continents.rename(columns = {'level_0':'continent'},inplace = True)
data_all_continents.head()

	continent	country	year	lifeExp	pop	gdpPercap
0	Asia	Afghanistan	1952	28.801	8425333	779.445314
1	Asia	Afghanistan	1957	30.332	9240934	820.853030
2	Asia	Afghanistan	1962	31.997	10267083	853.100710
3	Asia	Afghanistan	1967	34.020	11537966	836.197138
4	Asia	Afghanistan	1972	36.088	13079460	739.981106

#change of life expectancy over time for different continents
a = sns.FacetGrid(data_all_continents,col = 'continent',col_wrap = 3,height = 4.5,aspect = 1)#height = 3,aspect = 0.8)
a.map(sns.lineplot,'year','lifeExp')
a.add_legend();

In the above example, datasets were appended (or stacked on top of each other).

Datasets can also be concatenated side-by-side (by providing the argument axis = 1 with the concat() function) as we saw with the merge function.

17.4.3 Horizontal Concatenation (Combining Columns)

Setting axis=1 combines DataFrames side-by-side by aligning their indices.

# Concatenate horizontally (axis=1) - combine columns
result = pd.concat([df_existing, df_external], axis=1)
result

	FellowshipID	FirstName	Skills	FellowshipID	FirstName	Age
0	1001.0	Frodo	Hiding	1001	Frodo	50
1	1002.0	Samwise	Gardening	1002	Samwise	39
2	1003.0	Gandalf	Spells	1006	Legolas	2931
3	1004.0	Pippin	Fireworks	1007	Elrond	6520
4	NaN	NaN	NaN	1008	Barromir	51

Notice:

DataFrames are placed side-by-side
Index values are used for alignment (matching FellowshipIDs are aligned)
Non-matching indices result in NaN values
This is similar to an outer join on the index

When to use horizontal concat:

Adding new features/columns to existing data
Combining results from different analyses on the same observations
Building feature matrices for machine learning (each DataFrame contains different feature sets)

17.4.3.1 Handling Duplicate Column Names

When concatenating horizontally with duplicate column names, pandas keeps both columns:

# See column names when both DataFrames have 'Name'
result.columns

Index(['FellowshipID', 'FirstName', 'Skills', 'FellowshipID', 'FirstName',
       'Age'],
      dtype='object')

Both Name columns are kept, which can be confusing. You can use the keys parameter to create hierarchical column names:

# Use keys to label which DataFrame each column came from
result_labeled = pd.concat([df_existing, df_external], axis=1, keys=['Fellowship', 'Extended'])
result_labeled

	Fellowship			Extended
	FellowshipID	FirstName	Skills	FellowshipID	FirstName	Age
0	1001.0	Frodo	Hiding	1001	Frodo	50
1	1002.0	Samwise	Gardening	1002	Samwise	39
2	1003.0	Gandalf	Spells	1006	Legolas	2931
3	1004.0	Pippin	Fireworks	1007	Elrond	6520
4	NaN	NaN	NaN	1008	Barromir	51

Now you can clearly see which DataFrame each column came from. This creates a MultiIndex for columns (we’ll cover this in detail in future chapters).

17.5 Summary: Choosing the Right Method

Now that we’ve covered all three methods for combining DataFrames, here’s how to choose:

Method	Best For	Key Feature	Common Pitfall
`merge()`	Combining on shared column(s)	Flexible join types (inner/left/right/outer)	Forgetting to specify join type (defaults to inner)
`join()`	Combining on index values	Quick syntax for index-based joins	Index must be set up properly beforehand
`concat()`	Stacking DataFrames with same structure	Simple vertical/horizontal stacking	Doesn’t match on keys (just stacks/aligns)

17.5.1 Decision Tree:

Do you need to match rows based on values in columns?
- Yes → Use merge() (specify join type and on= parameter)
Do you need to match rows based on index values?
- Yes → Use join() (ensure indices are meaningful)
Do you just need to stack DataFrames?
- Vertically (add more rows) → Use concat(axis=0)
- Horizontally (add more columns) → Use concat(axis=1) (aligns on index)

17.6 Handling Missing Values After Combining Datasets

When you combine datasets using merge(), join(), or concat(), you often introduce missing values (NaN) for unmatched entries. Understanding why these appear and how to handle them is crucial for maintaining data quality.

17.6.1 Why Missing Values Occur

Missing values appear for different reasons depending on the join type:

Join Type	When NaN Appears	Example
Left Join	Right DataFrame has no match for left keys	Customer exists but has no purchases → purchase columns = NaN
Right Join	Left DataFrame has no match for right keys	Purchase exists but customer deleted → customer columns = NaN
Outer Join	Either DataFrame has no match	Missing data from both sides
concat (axis=1)	Indices don’t align	Different row indices → NaN where no alignment

Inner joins never produce NaN from the join operation itself (only matched rows are kept).

17.6.2 Strategies for Handling Missing Values

The right strategy depends on why the data is missing and what you’re analyzing.

17.6.2.1 Strategy 1: Keep the Missing Values

When to use: Missing values are meaningful and represent absence of something.

Example: Customer purchase analysis with left join

# All customers + their purchases (if any)
df_customer_purchases = customers.merge(purchases, on='CustomerID', how='left')

Interpretation:

NaN in purchase columns = Customer hasn’t made a purchase yet
This is informative - these customers might be new or inactive
Analysis: What percentage of customers have never purchased? requires keeping NaN

Action: Keep NaN and analyze it as a category (e.g., df['Purchase'].isna().sum())

17.6.2.2 Strategy 2: Fill with Default Values

When to use: Missing values should be interpreted as zero or a default state.

Example: Sales data where no transaction = zero sales

# Fill missing sales with 0
df_merged['TotalSales'] = df_merged['TotalSales'].fillna(0)
df_merged['TransactionCount'] = df_merged['TransactionCount'].fillna(0)

Common fill values:

0 - for counts, amounts, quantities
'None' or 'Unknown' - for categorical data
False - for boolean flags
Mean/median - for numerical features (use with caution!)

Caution: Only fill when you’re certain what the missing value means.

17.6.2.3 Strategy 3: Drop Rows with Missing Values

When to use: Analysis requires complete information from both datasets.

Example: Analyzing relationship between two variables (need both)

# Drop rows where either GDP or Population is missing
df_complete = df_merged.dropna(subset=['GDP', 'Population'])

Options:

dropna() - drop rows with any NaN
dropna(subset=['col1', 'col2']) - drop only if specific columns have NaN
dropna(thresh=5) - drop only if fewer than 5 non-NaN values

Caution: You’re losing data! Document how many rows were dropped and why.

17.6.2.4 Strategy 4: Use a Different Join Type

When to use: You’re getting NaN because you used the wrong join type!

Example: Used outer join but only need matching rows

# Wrong: Creates many NaN values
df_outer = pd.merge(df1, df2, how='outer')  # 1000 rows, 50% NaN

# Better: Use inner join to keep only complete matches
df_inner = pd.merge(df1, df2, how='inner')  # 600 rows, 0% NaN

Before handling missing values, ask: 1. Did I use the right join type for my analysis question? 2. Should I actually be using inner join instead of left/outer? 3. Are the missing values telling me about data quality issues?

17.6.3 Practical Example: Handling Missing Values

Let’s see these strategies in action with our fellowship data:

# Create an outer join to see missing values
df_outer_example = pd.merge(df_existing, df_external, how='outer')

print("Outer join result:")
print(f"Total rows: {len(df_outer_example)}")
print(f"\nMissing values per column:")
print(df_outer_example.isnull().sum())
print(f"\nRows with ANY missing value: {df_outer_example.isnull().any(axis=1).sum()}")

df_outer_example

Now let’s apply different strategies:

# Strategy 1: Identify WHERE the missing values are
print("Rows with missing 'Race' (from left dataset):")
print(df_outer_example[df_outer_example['Race'].isna()][['Name', 'Race', 'Birthplace']])
print()

print("Rows with missing 'Birthplace' (from right dataset):")
print(df_outer_example[df_outer_example['Birthplace'].isna()][['Name', 'Race', 'Birthplace']])

# Strategy 2: Drop incomplete rows (for complete-case analysis)
df_complete = df_outer_example.dropna()

print(f"Original rows: {len(df_outer_example)}")
print(f"Complete rows: {len(df_complete)}")
print(f"Rows dropped: {len(df_outer_example) - len(df_complete)}")
print()
df_complete

# Strategy 3: Fill with meaningful defaults
df_filled = df_outer_example.copy()
df_filled['Birthplace'] = df_filled['Birthplace'].fillna('Unknown')

print("After filling missing 'Birthplace' with 'Unknown':")
print(df_filled[['Name', 'Birthplace']])
print(f"\nMissing 'Birthplace' values: {df_filled['Birthplace'].isna().sum()}")

17.6.4 Best Practices for Missing Value Handling

1. Investigate First

# Always check WHERE and WHY values are missing
df.isnull().sum()  # Count per column
df[df['column'].isna()]  # See the actual rows

2. Document Your Decisions

# Good: Document what you did and why
# Filled missing GDP with 0 because these are countries with no economic data reported
df['GDP'] = df['GDP'].fillna(0)

# Also good: Create a flag for missing values before filling
df['GDP_was_missing'] = df['GDP'].isna()
df['GDP'] = df['GDP'].fillna(df['GDP'].median())

3. Consider the Impact

Dropping rows: Reduces sample size, may introduce bias
Filling with defaults: May distort statistics (mean, correlations)
Keeping NaN: Requires careful handling in analysis/visualization

4. Choose Join Type Carefully

The best way to handle missing values is often to prevent them by using the right join type for your analysis question!

Remember: Missing values after combining datasets are expected and normal. The key is understanding what they mean and handling them appropriately for your specific analysis goals.

17.7 Independent Study

17.7.1 Merging GDP per capita and population datasets

In this independent study on data reshaping, we have provided two datasets:
- gdp_per_capita_data
- population_data

This exercise builds on the results from the previous section.

Task:
Merge gdp_per_capita_data with population_data to combine each country’s population and GDP per capita into a single DataFrame.
Finally, print the first two rows of the merged DataFrame.

Assume that:

We want to keep the GDP per capita of all countries in the merged dataset, even if their population in unavailable in the population dataset. For countries whose population in unavailable, their Population column will show NA.
We want to discard an observation of a country if its GDP per capita is unavailable.

17.7.1.1

For how many countries in gdp_per_capita_data does the population seem to be unavailable in population_data? Note that you don’t need to clean country names any further than cleaned by the functions provided.

Print the observations of gdp_per_capita_data with missing Population.

17.7.2 Merging datasets with similar values in the key column

We suspect that population of more countries may be available in population_data. However, due to unclean country names, the observations could not merge. For example, the country Guinea Bissau is mentioned as GuineaBissau in gdp_per_capita_data and Guinea-Bissau in population_data. To resolve this issue, we’ll use a different approach to merge datasts. We’ll merge the population of a country to an observation in the GDP per capita dataset, whose name in population_data is the most similar to the name of the country in gdp_per_capita_data.

17.7.2.1

Proceed as follows:

For each country in gdp_per_capita_data, find the country with the most similar name in population_data, based on the similarity score. Use the lambda function provided below to compute the similarity score between two strings (The higher the score, the more similar are the strings. The similarity score is \(1.0\) if two strings are exactly the same).
Merge the population of the most similar country to the country in gdp_per_capita_data. The merged dataset must include 5 columns - the country name as it appears in gdp_per_capita_data, the GDP per capita, the country name of the most similar country as it appears in population_data, the population of that country, and the similarity score between the country names.
After creating the merged dataset, print the rows of the dataset that have similarity scores less than 1.

Use the function below to compute the similarity score between the Country names of the two datasets:

from difflib import SequenceMatcher

similar = lambda a,b: SequenceMatcher(None, a, b).ratio()

Note: You may use one for loop only for this particular question. However, it would be perfect if don’t use a for loop

Hint:

Define a function that computes the index of the observation having the most similar country name in population_data for an observation in gdp_per_capita_data. The function returns a Series consisting of the most similar country name, its population, and its similarity score (This function can be written with only one line in its body, excluding the return statement and the definition statement. However, you may use as many lines as you wish).
Apply the function on the Country column of gdp_per_capita_data. A DataFrame will be obtained.
Concatenate the DataFrame obtained in (2) with gdp_per_capita_data with the pandas concat() function.

17.7.2.2

In the dataset obtained in the previous question, for all observations where similarity score is less than 0.8, replace the population with Nan.

Print the observations of the dataset having missing values of population.

17.1 Introduction: Why Combine Datasets?

17.1.1 Common Scenarios Requiring Data Combination

17.1.2 The Three Main Combination Methods

17.1.3 Setup: Loading Required Libraries

17.2 Method 1: Merging with merge()

17.2.1 Basic Syntax

17.2.2 Example Datasets: Lord of the Rings Fellowship

17.2.3 Example 1: Basic Merge with Automatic Key Detection

17.2.4 Example 2: Merging with Different Column Names

17.2.5 Example 3: Merging on Multiple Columns

17.2.6 Understanding Join Types: The how Parameter

17.2.6.1 Inner Join (Default): Only Matching Rows

17.2.6.2 Left Join: Keep All Left DataFrame Rows

17.2.6.3 Right Join: Keep All Right DataFrame Rows

17.2.6.4 Outer Join: Keep All Rows from Both DataFrames

17.3 Method 2: Joining with join()

17.3.1 Key Difference: merge() vs join()

17.3.2 Important: Handling Overlapping Columns

17.3.3 Setting a Meaningful Index for Joining

17.3.4 Changing Join Type in join()

17.4 Method 3: Concatenating with concat()

17.4.1 The Two Directions: axis Parameter

17.4.2 Concatenating Along Rows (Vertical Stacking)

17.4.2.1 Simple Example with Fellowship Data

17.4.2.2 Real-World Example: Combining Continental Data

17.4.3 Horizontal Concatenation (Combining Columns)

17.4.3.1 Handling Duplicate Column Names

17.5 Summary: Choosing the Right Method

17.5.1 Decision Tree:

17.6 Handling Missing Values After Combining Datasets

17.6.1 Why Missing Values Occur

17.6.2 Strategies for Handling Missing Values

17.6.2.1 Strategy 1: Keep the Missing Values

17.6.2.2 Strategy 2: Fill with Default Values

17.6.2.3 Strategy 3: Drop Rows with Missing Values

17.6.2.4 Strategy 4: Use a Different Join Type

17.6.3 Practical Example: Handling Missing Values

17.6.4 Best Practices for Missing Value Handling

17.7 Independent Study

17.7.1 Merging GDP per capita and population datasets

17.7.1.1

17.7.2 Merging datasets with similar values in the key column

17.7.2.1

17.7.2.2

17.2 Method 1: Merging with `merge()`

17.2.6 Understanding Join Types: The `how` Parameter

17.3 Method 2: Joining with `join()`

17.3.1 Key Difference: `merge()` vs `join()`

17.3.4 Changing Join Type in `join()`

17.4 Method 3: Concatenating with `concat()`

17.4.1 The Two Directions: `axis` Parameter